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IIOW WE SHOULD MEASURE "CHANGE"— OR SHOULD WE? 1 

LEE J. CRONBACH 2 and LITA FURBY 3 
Stanford University 

Procedures previously recommended by various authors for the estimation of 
"change" scores, "residual" or "basefree" measures of change, and other kinds of 
difference scores are examined. A procedure proposed by Lord is extended to obtain 
more precise estimates, and an alternative to the Tucker-Damarin-Messick pro- 
cedure is offered. A consideration of the purposes for which change measures have 
been sought in the past leads to a series of recommended procedures which solve re- 
search and personnel-decision problems without estimation of change scores for 
individuals. 



A persistent puzzle in psychometrics has 
been "the measurement of change." Many in- 
vestigators have felt, for reasons good or bad, 
that their substantive questions required a 
measure of gain in ability or shift in attitude. 
"Raw change" or "raw gain" scores formed by 
subtracting pretest scores from posttest scores 
lead to fallacious conclusions, primarily be- 
cause such scores are systematically related to 
any random error of measurement. Although 
the unsuitability of such scores has long been 
discussed, they are still employed, even by 
some otherwise sophisticated investigators. 

At the end of this paper the authors argue 
that gain scores are rarely useful, no matter 
how they may be adjusted or refined. The 
authors also distinguish four kinds of inquiry 
for which such scores have been used, and con- 
clude that only one of these purposes is well 
served by any kind of gain score. This argu- 
ment applies not only to changes over time, 
but also to other differences between two vari- 
ables. 

The first part of the paper proposes superior 
ways of estimating true change and true resid- 
ual change scores. It may seem pointless to 
discuss such matters when in the end we rec- 
ommend against their use (save in a few kinds 
of investigation). However, the development 
of formulas clarifies the model, providing a 
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base for the final recommendations, and allows 
us to explain the limitations of previous papers 
on the subject. Furthermore, it develops supe- 
rior estimators for the kinds of problem where 
we do recommend using a measure of change. 
Very likely some investigators will decide to 
obtain change or difference scores, even for 
problems where we consider such measures in- 
appropriate. Such a person will often find one 
of our estimation formulas better than those 
now suggested in the literature. 

Methods of handling data from successive 
measurements have been offered by several 
writers (Harris, 1963). We shall be particularly 
interested in the related proposals of Lord 
(1956, 1958, 1963) and McNemar (1958), who 
estimate an individual's "true change"; their 
approach is clearly superior to conventional 
techniques. This paper extends the Lord- 
McNemar reasoning to get a still better 
estimate. 

DuBois (1957) and other investigators rec- 
ommend a "residual gain" score as a substitute 
for the "raw gain" score. A gain is residualized 
by expressing the posttest score as a deviation 
from the posttest-on-pretest regression line. 
The part of the posttest information that is 
linearly predictable from the pretest is thus 
partialled out. Tucker, Damarin, and Messick 
(1966) draw attention to the "true residual 
gain," which they refer to as a "basefree mea- 
sure of change." 

FORMULATION 

The Lord-McNemar argument considers 
measures X and Y, obtained by applying the 
same operation to the subject on two occasions. 
The subject has an observed difference score 
D = Y — X. Scores X x and F M , represent- 
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ing the person 's "true" status at these times, 
are postulated. The "true difference" D K equals 
7„ — X„. The key topic of the McNemar 
paper and the Lord papers is the determination 
of regression coefficients for an estimator of the 
form: 

Ao = PiX + Øs7 + constant [1] 

As Lord has written more extensively than 
McNemar on this matter, we shall refer only 
to Lord hereafter, though many of the com- 
ments also apply to what McNemar has said. 

The development holds so long as X and 7 
are referred to the same numerical scale. That 
is to say, the investigator must be willing to 
say what score on Y (or 7«,) is comparable to a 
given score on X (or IJ. The relation must be 
reciprocal in the sense that, if the given X is 
mapped into a certain 7, that value of 7 is 
mapped into the given X. It is particularly 
important to note that this formulation side- 
steps the philosophically troublesome question, 
Are pretest and posttest "measuring the same 
variable"? A common metric is the only re- 
quirement. Even this requirement is dispensed 
with when we turn to a regression estimate of 
outcome. 

The model applies to any two measures that 
can be sensibly expressed on the same numeri- 
cal scale. This might be, for example, a stan- 
dard-score scale or an age-equivalence scale. 
The data might be ratings of two distinct traits 
expressed on the same reference scale. Hence 
statements made about change scores can be 
extended to any kind of difference score. 
Among the more famous lines of research em- 
ploying difference scores are work on "over- 
achievement," "empathy and insight," "self- 
concept" versus "ideal self," and "differential 
aptitudes." Indices of the same psychometric 
character are also involved in studies of re- 
tention and transfer. 

Assumptions. There are two variables and X 
7. For each variable, the expected value over 
independent observations of the same person 
defines a true score: E(X) = X a and E(Y) 
= 7 M . Then D a = 7« - X x . The investigator 
who identifies 7 with X as "the same opera- 
tion" must keep the true scores distinct. In a 
study of change, X x is the person's average 
score over measurements of X that might be 
made at Time 1 ; 7«, is the average over ob- 
servations by the same procedure at Time 2. 



Errors are uncorrelated with true scores. 
But we do not assume that all correlations pxy 
are equal. Sometimes X and 7 observations are 
"Iinked," as when the two scores are obtained 
from a single test or battery administered at 
one sitting, or when observations on different 
occasions are made by the same observer. The 
correlation between linked observations will 
ordinarily be higher than that between inde- 
pendent observations. This distinction has not 
been made in the literature on change, but it 
does appear in a paper by Stanley (1967) on 
difference scores. 

We develop a formal mathematical model 
so that some parts of our argument can be ex- 
plicit rather than intuitive. We retain the 
classical concept of strictly parallel observa- 
tions, but modify the classical concept of in- 
dependence, as suggested above. 

1. Let X pi represent the observed score of 
person p on the X variable observed under 
condition i. The condition i may be, for ex- 
ample, a particular form of the test that mea- 
sures X. In studies where there are several 
sources of "error" such as test form, observer, 
and short-term fluctuations in the state of the 
person, we assume that these sources are com- 
pletely confounded in the design used to deter- 
mine reliability coefficients and correlations. 

2. Where the classical theory considers true 
score as the sum of observed score and error, we 
introduce two random errors: X pi = X xp 
+ e pi +f pi . Likewise, Y pi = Y xp + e pi + i pi 
This is required to formalize the concept of 
independence adequately. 

If, as in the classical concept of parallel mea- 
sures, one assumes that the several measures of 
X (or 7) are strictly interchangeable, there 
can be no linkage. Machinery for explicitly de- 
fining independence can be developed with 
some concepts from generalizability theory. 
There is a universe of possible conditions of 
observation of X. (While observations may 
vary with respect to test form, occasion, ob- 
server, etc, we shall avoid here the complica- 
tions of multifacet theory [Gleser, Cronbach, 
& Rajaratnam, 1965]. That theory recognizes, 
as this paper does not, that one may have two 
measurements with the same test form on 
different occasions, or two measurements on 
different test forms on the same occasion, or 
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two measurements where both form and oc- 
casion differ.) There is a universe of observa- 
tions of F, and we shall assume that these may 
be made under the same set of conditions i that 
are used for X observations. Then observations 
X ( and Yi made under the same condition i are 
said to be linked, and observations X,- and Yi> 
made under different conditions are said to be 
independent. 

Formally, the model specifies that conditions 
of observations are drawn from the universe. 
If a single i is drawn and used to obtain both 
scores X pi and Y pi for person p, we have link- 
age; if i and i' are drawn independently to give 
scores X P i and Yp^, we have independence. 
Errors are random and independent, except 
that when X and Y are both observed under 
condition i, the/ and f components are sampled 
simultaneously. It will not be true, in general, 
that <r(fpi,i P i) = 0. The following assumptions 
are made regarding all e components: Their 
mean over persons is zero for every condition ; 
their variance over persons is the same for 
every condition; their intercorrelations with 
other components are zero. The e components 
satisfy the same assumptions and a(e pi ,e P i) 
= <r(/ P i,e P i) = 0. With regard to the / com- 
ponents, zero means, equal variances, and zero 
correlations with true scores are assumed 
(likewise for f). Also, <r(f P i,i P i>) = 0 for all 
pairs where i is not identical to i'. 

3. The measures of X made under different 
conditions are parallel. It follows from the as- 
sumptions above that the measures have equal 
means, equal variances, and equal intercor- 
relations. The same is true for Y measures. 

4. It follows, now, that 

c(X P i,Ypi') = ciX^pjY^p), 

hence this covariance is the same for all in- 
dependent observations of X and Y . 
For linked observations the covariance 

a-(X P i,Ypi) = a(X xp ,Y xp ) + <r{{ pi ,l Pt ) 

We assume that <r(f P i,i P i) is the same for all i, 
hence that the linked covariance is the same 
for all linked X, Y observations. The covari- 
ance of / with f may be large or small, depend- 
ing on the extent to which the condition i 
influences the X and Y performances. 

We shall simplify and compress notation in 
several convenient ways. We write pxx> for a 
reliability coefficient, and pxy for a correlation. 



For emphasis, when linked observations are 
correlated or used to form a difference score we 
may refer to X a and Y a ; their covariance and 
correlation we shall designate *vxy and *pxv- 
We shall similarly write X a and Y h (or the like) 
for a pair of independently observed scores, and 
for the covariance and correlation shall write 
Oaxy and o pxy . 

Ordinarily, | *pxy \ > \ °pxy | - It is possible 
that »pxy < °pxy, where X a and Y a have some 
complementary relation. For instance, if rate 
of reading and comprehension are measured on 
the same selections simultaneously, one score is 
likely to rise at the expense of the other and 
<r(f P i,{ P i) will be negative. 

S. It is assumed that the population param- 
eters are known. All parameters considered 
must be for the same population. 

A regression estimate of a true score is usu- 
ally improved if the data permit one to derive 
an equation for a subpopulation — for example, 
for ninth-grade boys in a certain school rather 
than for the national ninth-grade population. 
The investigator using actual data will often 
have made no reliability study on his sample. 
If he uses a reliability coefficient or a value of 
pxy from the published study he must adjust 
it to take into account the variance of his 
sample. 

Reliability of a Difference Score 

As Stanley (1967) pointed out, linkage must 
be taken into account in defining a reliability 
coefficient for differences. Classical theory, 
ignoring linkage, dennes a reliability 

<TPD' <r' 2 xpxx'-\-o 3 YPYY'—1<Jx(rYpXY p--, 

PDD <7 2 D tPx-V^Y — laxOYPXY ' 

where D = F — X. The reliability coefncients 
pxx- and p Y y are correlations of independent 
observations. Likewise, because of the inde- 
pendence assumption, classical theory can only 
interpret oxy as what we have called o pxr , 

The reliability of a difference score is dermed 
as the correlation of the score with an indepen- 
dently observed difference. Unlike the classical 
papers, Stanley distinguishes p{Y a -x a ){Y b -x h ) 
from PWa-xøWe-Xi)- These are distinct kinds of 
reliability, one for a difference of linked X and 
F and one for a difference of independent X 
and F. 

If the observed difference is D«* = Y„ — X b 
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(i.e., if X and F are experimentally indepen- 
dent), the covariance with an independently 
observed difference D ci is 

o^xpxx' + o^yPyy 1 — 2<rx<rY°pxY- [3] 

Even if D M = Y a — X a (i.e., X and Y are ex- 
perimentally linked), the covariance with a 
similar but independently observed linked 
difference Du, = F& — X b is the same. Hence 
Formula 3 is the appropriate numerator for 
Formula 2 in both cases. 

The variance of D for the independent case is 

a 2 x + ff 2 y — 2ax(TY 0 PXY. [4] 

For the linked case, however, D aa = Y a — X a 
and the variance equals 

a 2 x + <rV — 2axaY*pXY< [5] 

Hence for the linked case the reliability co- 
efficient P(Y a -x a x.Yt-x b ) is obtained bydividing 
Formula 3 by Formula S : 

a z xpxx' +(t 2 yPy y> — 2a X (TY 0 fXY ,- -, 
PDD c i x+a"'y—2ax<TY*pxY 

whereas for the independent case the reliability 
coefficient pix a -x b )(Y e -x d ) is Formula 3 divided 
by Formula 4 : 

<r s xpxx' +<r 2 YPY v — 2<tx<ty°Pxy r ^-, 

°PDD'= 2 _1_ 2 O r> L'J 

a'x+<ry — 2<rx<TY°pXY 

Since o pxr < *pxy in most instances, °pdd' 
for the independent case will most likely be 
smaller than *pdd' for the linked case. Both 
reliability coefficients are meaningful. They 
describe the correlation between differences 
observed according to different experimental 
designs. Distinctions like that between Equa- 
tions 6 and 7 have to be made in considering 
the reliability of any composite, weighted or 
unweighted. 

In the discussion that follows formulas are 
written in terms of the linked case; the sub- 
stitution to fit the fully independent case will 
be obvious. 

ESTIMATORS OF TRUE CHANGE 

Primitive Formulas 

Among the possible estimators of D„ are 
three simple formulas. 

Raw gain. The simplest formula is : 

D-Y-X røfl] 



Correction by simple regression for error in X. 
If X„ is a regression estimate of X x from X, 

D»-F-*„-F-p„.X+ccmsta!it [9] [2] 

In this and all other equations through Equa- 
tion 20, the constant is one that makes the 
mean (over persons) of the estimates equal to 
the mean raw gain. The foregoing estimate has 
occasionally been suggested (e.g., Trimble & 
Cronbach, 1943), but it has seen little use. 
Closely related concepts appear in Lord's 
comments on analysis of covariance (1960) and 
in a series of Swedish papers on the effect of 
schooling on intelligence (see Harnqvist, 1968). 

Correction by simple regression for error in X 
and in Y. To take a further step, let Y m be an 
estimate of Y x from F. Then 

£>x,= Y M —X x> =p Y y>Y—pxx'X+constant 

DO] [3] 

This procedure does not take the X, F correla- 
tion into account. 

The Lord Procedure 

Lord pointed out that unless px k y, = 0, 
both X and F yield information about X K . A 
multiple regression procedure can be used to 
obtain 

X M = P xx>X+Pxjy.x ) (F • X) + (1 - PXX -)X [1 1] 

Here F • X is a partial variate, the deviation of 
Y from the value predicted by the regression of 
F on X in the population to which the other 
parameters in Equation 11 apply. 
For the linked case we have 

Y a -X a = Y a - *^(X a -X)-f [12] 

c x 

We know that 

Px„(Y a .x a ) - — 2 L.13J 

" (Y*-X a ) 

Now ax m Y - ox a Y b = °a X Y (not •(txy). Using 
Equations 12 and 13, the numerator of /? be- 
comes 

VX a (Y a -X a ) = <TX<Ty{°PXY — *pXYpYY') [14] 

and the denominator becomes 

°V 0 or 0 ) = o- 2 y(l - Vxt) [15] 
Substituting in the regression equation, we 
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arrive at 



X x — 



PXX' — *PXY°PXY 



X 



+ 



1 — *P 2 XY 

Vx(°PXY — *PXYPXX >) 

ery(l — *p 2 xy) 



Y + constant [16] 



Then 



£> x — Y x % x 



[17] [4] 

where the estimates on the right-hand side 
come from Equation 16 and its analog for Y x . 
Expanding the expression we have 

A 1 \ Y , 

D m = -, (<TYPYY'—<rY m PXY°PXY 

\_?Y 



1- 



" *p i XY 



-\-<rx *pxyPxx' —cx 0 Pxy) (<rxpxx> 

-<rX*pXY°PXY-\-0-Y*PXYpYY' ~ <Ty°PXy)^ 

+constant [17a] [4] 

For independent observations, one substitutes 
°Pxy wherever *pxy appears in Equation 17a; 
this is Lord's estimator of D x . In an experi- 
ment, all calculations must be made within a 
treatment group. 

Estimator 4 is as good or better than any of 
those listed ahead of it, giving a smaller mean 
square of Ø x — Ao) and a larger correlation 
between estimate f) x and true score D x . Ordi- 
narily Estimator 3 is better than Estimator 2 
and both are better than Estimator 1. Our 
main point in presenting Estimators 2 and 3 is 
to show the Lord estimator as an elaborated 
form of a more conventional estimator. This 
lays a base for the further refinement. 

It may seem anachronistic in Formulas 11 
and 16 to use a posttest score to "predict" a 
pretest score. But the logic is clear. Within a 
treatment, persons higher on the posttest than 
others having the same observed pretest score 
tend to be those for whom the true pretest 
score is higher than the observed score. The F 
receives at least nominal weight in the regres- 
sion equation when pxy is not zero or one, and 
pxx< < 1.00. The weight given to F increases 
with larger pxy and smaller pxx'- 

Taking group membership into account. Pre- 
vious papers have implicitly assumed that all 
persons come from a single population, but 
often there are several distinct subgroups. 
These groups may be distinguished by demo- 



graphic characteristics or by past experience, 
or they may be groups receiving different 
treatments between the X and F observations. 
One could pool all groups and determine a 
single within-group value for each parameter in 
the equations above, but parameters calcu- 
lated within subgroups will give a better esti- 
mate of X x , Y x , and D x . There is a limit to 
how far subdivision of samples can profitably 
be carried, however. 

Correlations and Regression Slopes for D x 

Sometimes an investigator wishes to know 
the correlation of D x with another variable, 
say Q. He should note, then, that the correla- 
tion of D x with Q is not a sound estimate of the 
correlation of D x with Q. The correlation should 
be determined directly from the covariance of 
D x with the second variable of interest. This 
covariance takes a form such as 

To get the correlation coefficient one divides by 
<tq and <ro a (not <td J. SinceA,, equals Y x — X x , 
its variance is given by Equation 3. All param- 
eters must be those for the same group. 

The investigator is often interested in the re- 
gression of D x (or Y x ) on another variable. 
The slope of the regression of D x on (e.g.) X x 
is VDjcJfx m . 

Attention must be paid to linkages. Let us 
write Q c to indicate that the observation of Q 
is independent of X a , F„, and Y b . 

a »«,Qc ~ "YbQc — "X a Q e = VY a Q c — <rX„Q c - [18] 

This cannot be determined from data where Q 
is linked to X or F. 

A variance-covariance algoriihm. A simple 
computational routine can be suggested for 
problems of this character. One may form a 
variance-covariance matrix of observed values 
as in Figure 1. Linked and independent covari- 
ances are carefully distinguished. The matrix 
may be augmented as shown, adding rows and 
copying forward covariances. Reliability in- 
formation is taken into account at certain 
points. Columns may be added to the matrix 
also, in a symmetric manner. Thus all entries 
in the X x row can be copied into the X x col- 
umn. The full extension carried out in this way 
gives a square matrix, from which such values 
as UD q and tr 2 D can be read out. 
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Copy independent 
covariances corre- 
sponding to Row 1 

Copy independent 
covariances corre- 
sponding to Row 4 

Fill in by sub- 
tracting Row 1 
from Row 2 

Fill in by sub- 
tracting Row 1 
from Row 4 

Fill in by sub- 
tracting Row 6 
from Row 7 

Copy independent 
covariances from 
Row S 



Row 


1 


x a 


<r'x 


%axY 




OaxY 


OffXQ 


Row 


2 


Y* 


•ff XY 


<tr 




ff 2 YpYY' 


OffYQ 


Row 


3 


Qa 


•ffjrg 






OffYQ 


2 

a'QPQQ' 


Row 


4 


Y„ 


OaxY 


a*YPYY' 


OffYQ 


ff 2 Y 


OffYQ 


Row 


S 


Qc 


OaxQ 


OffYQ 


0*QPQQ' 


OffYQ 


^Q 


Row 


6 


x„ 


ff 2 XPXX- 


| OffxY 
1 

1 


OffXQ 


OffxY 


OffXQ 


Row 


7 


Yae 


Oa X Y 


_J 

ff 2 YPYY> 


OffYQ 


ff^YPYY' 


OffYQ 



Row 8 D a , 



Row 9 D ai 



Row 10 D„ 



Row 11 Q« 



Y b 



Xa, 



o- 'XPXX' 
OffxY 



<^XPi 



OffXY 



FiG. 1. Algorithm for constructing covariances and variances. (Covariances tor linked observations are identified 
by the symbol 9a, and those for independent observations by Oa .The brolcen line separates the original data from 
entries added later.) 



A Better Estimate of True Change 

Lord's formula uses only X and Y data, but 
we shall bring in two further categories of 
variables, W and Z. The W and X are Time-1 
measures, but need not be simultaneous. Al- 
though we write W without vector notation, 
there are, in principle, any number of W vari- 
ables that can be used singly or in combination. 
Our statements apply to any W or any weighted 
composite of the W. 

A W might be any score describing the 
subject as he was prior to the treatment under 
study or W might be an index based on his life 
history. The scores Y and Z are posttreatment 
measures — again, not necessarily simultane- 
ous. The Y might, for example, be a measure of 
performance at the end of training, and Z a re- 
tention test a month later. Where we are ex- 
amining a difference score rather than a change 



score, no distinction between W and Z vari- 
ables is needed. 

The steps taken in going from Estimate 2 to 
Estimate 4 above can be extended to make use 
of W and Z information so as to reach an even 
better estimate of £>«,. 

The complete estimator. If W is univariate and 
there is no Z information, 

X x = Pxx> X+ a ^(Y.X) 



<T 2 



(^•Z,F)+constant [19] 



Here W-X, Y is a partial variate, the deviation 
of W from the value predicted by the regression 
oiWonX and Y. 

If the W information is multivariate, a whole 
series of?partial variates enters. The order of 
partialling is arbitrary; one might write terms 
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for X; W-X; Y-W, X; etc. Where there is Z 
information, one adds further terms, again 
employing partial varia tes; W, as well as X 
and F, is partialled out. The estimation pro- 
cedures must take into account any linkage 
betwcen X, Y, W, and Z variables as in Equa- 
tion 16. 

A similar equation is written for Y m . 
Finally, one comes to an equation of the form 

A«, = PiX + faY 

+ p 3 W + piZ + constant [20] [5] 

Here /3 S W stands for a string of several terms of 
the form ØiWi if there are many W (likewise 
for fiiZ). This estimator is superior to Formula 
4, provided that the sample size is large enough 
to justify assigning a large number of weights. 
Where sample size is insufficient, the number of 
predictors must be held down, most likely by 
employing the first one or more principal com- 
ponents of the W set as predictors (likewise 
for Z). 

When the problem becomes complicated, it 
is better to use efiicient computing routines 
than to write out elaborate formulas. The 
within-treatment covariance matrix for X, Y, 
W, Z is written. Additional rows and columns 
for X K and Y x are formed as in Figure 1, with 
care to enter independent or linked covariances 
as required. The X„ column is subtracted from 
the Y x column to form the D x column. When 
the symmetric matrix is complete, one applies 
a multiple-regression program, treating entries 
in the D K column as "test-criterion covari- 
ances" and the appropriate X, Y, W, and Z as 
predictors. If the observed scores have X and 
F linked, for example, then X a and Y a are 
used as predictors, and covariances for Y b are 
ignored. 

Demographic information and information 
about experience can and should be considered 
as W variables. With a variable such as sex, one 
has a choice of entering it directly as a variable 
coded 1 and 0, or of performing a separate re- 
gression analysis for each sex. Both procedures 
regress the person's score toward the mean 
for his own sex rather than toward the mean for 
all cases. The second procedure allows for the 
possibility that the regression surface for males 
differs from that for females. Separate within- 
subgroup regressions would seemingly be pre- 
ferred when samples are truly large. 



The difEculty is that the argument can be 
repeated for every other noncontinuous vari- 
able and for all combinations of them. Indeed, 
it applies to continuous variables also; for 
example, a regression surface for more anxious 
children may differ from that for the less anxi- 
ous. These remarks amount to entertaining the 
possibility of nonlinear relationships. While 
this possibility is real enough, one can rarely 
get usable estimates of nonlinear functions 
from samples of practical size (Burket, 1964; 
Goldberg, 1969). Hence, af ter one has divided 
the sample into a few salient subgroups, each 
having a suitably large size, the dummy- 
variable technique seems to be the only feasible 
way to handle a variety of nonquantitative 
information. 

Residualized Gains 

Developments parallel to those above lead to 
so-called "residual gains" or "basefree mea- 
sures of change." 

A Iternalive Estimaters 

The raw residual-gain score is defined by 

D-X=Y-E(Y)\X=Y-?-p y . x (X-X) 

[21] [1'] 

We designate this Formula 1' to emphasize that 
it is comparable to Formula 1, the raw gain. If 
Y a and X a denne the difference score, fiy-x 
equals 'pxyvr/ax- If Y b rather than Y a is used, 
Opxv replaces *pxy- The traditional definition 
given by Equation 21 is ambiguous; the resid- 
ual (Y a — X a )-X a is conceptually different 
from the residual (Y b — X a )-X a , 

Residualizing removes from the posttest 
score, and hence from the gain, the portion that 
could have been predicted linearly from pretest 
status. One cannot argue that the residualized 
score is a "corrected" measure of gain, since in 
most studies the portion discarded includes 
some genuine and important change in the per- 
son. The residualized score is primarily a way 
of singling out individuals who changed more 
(or less) than expected. 

True residual gain could be defined either as 
the expected value, over many observations on 
the same person, of D X (defined in either of 
the two possible ways), or as the partial variate 
D„-X„, the part of the true gain not predict- 
able from true pretest status. D K -X K is more 
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likely to be of interest. Certainly if we intend 
to pick out superior learners or persons whose 
self-concept falls far below the self-ideal, we 
would like to base the discrimination on a true- 
score disparity. Likewise, if we have correla- 
tional questions — for example, Does anxiety 
predict overachievement? — the variable seems 
better specified by D x -X x . Hence we consider 
only the definition : 

D x • X x =D X -fø^X.+constant 

= FM _W, lB+consUnt p2] 
™» 

No ambiguity arising from possible linkage of 
the observed X and Y enters this definition, 
though linkage must be considered in any esti- 
mation procedure. 

Successive estimators of true residual change 
can be constructed on the same principles as 
before, but it will suffice to move directly to the 
estimator formed by the multiple-regression 
principle in the manner of Lord: 



D x • X x — 



PXX'pYY 



,_<V; 



P'XY 



(pxx'°Pxy) (1 — *P 2 xy) 



x(v a -^x a ) 



+constant [23] [4'] 



This and other constants in this section are de- 
fined to make the mean of estimated residual 
gain equal zero. This estimate turns out to be 
proportional to the raw residual gain. (If the 
estimate is made from independent Y and X, 
°Pxy replaces *pxr)- 

If there is W or Z information, a still better 
estimator is 



column of the matrix by cdjcJo^x^, en tering 
this in a column to one side. Subtracting the 
entry in this side column from the entry in the 
corresponding row of the D x column gives a 
covariance to be entered in a D x • X x column. 
This column is now taken as a set of covari- 
ances of predictors with the criterion, and a 
multiple-regression program is applied. 

The Tucker-Damarin-Messick Proposals 

This analysis puts us in a position to review 
and clarify the rather puzzling paper entitled 
"A base-free measure of change" (Tucker, 
Damarin, & Messick, 1966; hereafter, referred 
to as TDM). They start much as we do by 
noting that the psychometrics of a change score 
applies to all kinds of difference scores. They 
suggest, as Lord did, that one should be most 
interested in the true difference score. They 
propose to divide this difference into two com- 
ponents, "one entirely dependent on the true 
score of the first or base-line test" and one "en- 
tirely independent of it." That is, they are 
interested in a true predicted gain and a true 
residual gain. As their abstract says, "equa- 
tions for estimating both components are 
given." Since we shall recommend against use 
of their equations, we shall not go into details of 
their argument. 

It might appear that TDM are concerned 
with estimating E(D X ) \X X and D x — E(D K ) | 
X x ( = D x -X x ). The former, of course, is a 
linear function of X x . TDM arrive at an equa- 
tion (their Equation 26) that, in a form con- 
sistent with the present paper, is 



D x -X x =p\X+0'2Y 

+/3' 3 IW4Z+constant [24] [5'] 

Only in rare cases will this be proportional to 
the observed D-X. 

The computational algorithm used before 
can be extended to obtain the desired weights 
for Formula 4' or 5'. Suppose we have filled out 
the square matrix for X, Y, X x , Y x , D x , .... 
Then we can form the covariance of any vari- 
able with D x ■ X x very simply. For example, 

<*D„X„ rv,,--, 
<T(D„-XJQ = <TD„Q - -^—ffX„Q L^J 

oo 

Hence we may multiply every entry in the X x 



Y a ~ 



Y b ~ 



9 Pxy<ty 

PXX'VX 

or 
QPxyVy 

PXX'&X 



X a 



x a 



+constant [26] 



It is rather startling to find that this agrees 
with none of our formulas. It differs from 
Formula 1' in that X is replaced by X/pxx 1 - It 
differs by a further constant of proportionality 
from Formula 4'. The marked departure from 
Formula 4' is made the more puzzling by the 
favorable references of TDM to the Lord and 
McNemar papers and by their recommenda- 
tion of Formula 4 for the gain score itself. 
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Personal communication with the authors 
verified that a failure of communication had 
occurred 4 . As readers, we had given too little 
weight to one key phrase: The measures are 
"primarily intended for correlational work." 
That is to say, TDM have no intention of inter- 
preting "basefree" scores for individuals. Such 
scores are intended only as an intermediate 
step toward correlations. TDM offer an esti- 
mator that does not give the best least-squares 
estimate of individual "basefree" scores be- 
cause they seek instead estimates that correlate 
zero with X x . 

Their intention is to determine correlations 
so as to learn what kinds of person show gains 
larger than would be predicted from the true 
pretest score. Correlating the estimated true 
residual gain from Equation 23 or 24 with an- 
other variable does not give the correlation 
TDM desire (unless pxx> is 1.00). To under- 
stand this, consider for a moment the simple 
correlation of Y with Q. If we do not know Y 
scores but do know pxy, we might estimate Y 
from X scores by the usual regression equation. 
Then pfg will not be a good approximation to 
Pyq; it will actually equal pxq- In general, one 
who wants to interpret correlations, covari- 
ances, or regression slopes ought not to work 
from estimated scores. TDM intended to rec- 
ommend that in such a line of research one 
calculate special-purpose scores by Formula 26 
and then determine correlations. This appears 
to be unsound. TDM desire to obtain correla- 
tions with various Q of y and f , which in their 
notation are the residual and predictable por- 
tions of true gain, respectively. But the TDM 
formulas generate fallible values g and w; g 
equals y plus an error. Obviously p„q < p 7 q 
and p wQ < piq. As this is not explained by 
TDM, their paper is likely to mislead the 
reader. The confusion is reflected, and to some 
degree intensified, when Traub (1967) and 
Glass (1968) comment on the TDM formula. 

While one might adapt the TDM statements 
to get p y Q, piq, etc, this is unnecessary. A 
straightforward manipulation of the matrix of 
observed covariances for X, Y, and Q (along 
with pxx' and p Y y) yields the covariance of Q 
with D X -X K (i.e., with 7). The a 2 D„-x m ( = o\) 
needed to reach a correlation is simply the co- 

4 L. R. Tucker, F. Damarin, and S. Messick, personal 
communication, September 1968 and April 1969. 



variance of D x with that we have al- 

ready obtained. To get covariances for D x 
— D x -X x (i.e., for f), one need only subtract 
column D x -X x of the covariance matrix from 
column D x . And o- 2 ^ = a\ + tr 2 f. 6 

A MULTIVARIATE CONCEPTION 

The older statement of the problem as "the 
measurement of gain" or of "residual gain" 
implies a special affinity between X and Y — 
they are seen as "the same variable" in some 
sense. But change is multivariate in nature. 

Even when X and Y are determined by the 
same operation, they often do not represent the 
same psychological processes (Lord, 1958). At 
different stages of practice or development 
different processes contribute to performance 
of a task. Nor is this merely a matter of in- 
creased complexity; some processes drop out, 
some remain but contribute nothing to indi- 
vidual differences within an age group, some are 
replaced by qualitatively different processes. 
This does not rule out purely empiricai studies 
of changes in the operationally defined variable. 
To assess such changes, even when one cannot 
describe them qualitatively, may be practically 
important. One must be careful not to fall into 
the trap of assuming that the changes are in a 
particular psychological attribute. 

We may illustrate by referring to Fleish- 
man's well-known studies of psychomotor 
scores at successive stages of practice (see 
Fleishman, 1966). On the first few trials, scores 
tend to correlate with cognitive measures; the 
usual speculation is that the high -cognitive 
subjects gain fastest because they most rapidly 
comprehend the instructions, display, strategy, 
etc. In a second stage, certain pretests of psy- 
chomotor ability correlate highest with scores, 
and the correlation of scores with cognitive 
pretests becomes rather small. One could easily 
conclude that cognitive ability is unimportant 
to learning in this second stage. But the drop in 
correlation (assuming no great rise in SD) 
demonstrates something more striking: that 
cognitive ability is negatively correlated with 
change from the first to second stage. And one 
can suggest a good reason. If bright persons 

6 This matrix-extension procedure is entirely consist- 
ent with the TDM rationale. In faet, TDM tell us that 
their lormula was arrived at by analytic treatment of 
this extended matrix. 
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catch on fastest, by some trial / they have com- 
pleted what can be done intellectually with the 
task. On trials t + / and later, their cognitive 
abilities produce no further gains. But by our 
hypothesis the dull have not comprehended 
fully by trial t, and therefore they have cogni- 
tive work left to do on trials l + J,t + 2, . . . 
During this stage, any gain in performance thai 
comes from improved comprehension will be a 
gain by those low in cognitive aptitudes, hence 
those aptitudes have a negative correlation 
with gains. The Fleishman studies have not 
been processed in terms of gain scores, but it 
has been commonly said that they indicate 
cognitive abilities to be "important during the 
early stages of practice on psychomotor tasks," 
or the like. It seems more accurate to say that 
cognitive abilities make their contribution at 
different times for different persons and that 
gains at any one time are due to different pro- 
cesses for different persons. 

Something similar is to be said about the re- 
lation of mental age (MA) at one age to sub- 
sequent gains in MA or achievement. A positive 
relation would result insofar as the high scorers 
understand new material better, or are more 
efiicient learners. But there would also be a 
negative relation, insofar as the high scorers are 
those who have already mastered some highly 
valuable technique (e.g., mediation) that the 
low scorers have yet to master. As they re- 
structure their behavior, the persons with low 
scores at the start of the period may make large 
gains— gains the high scorers had previously 
made. Positive and negative elements are prob- 
ably both present, which should make us much 
less surprised than we have been by reports of 
near-zero correlation of MA (in a constant-age 
group) with subsequent gain in MA (Ander- 
son, 1939; Bloom, 1964, p. 26 ff., 62 ff.). 

We reduce emphasis on the special role of X 
as precursor of Y, and regard the whole WX 
set as a vector describing the person's initial 
status. Then one may ask how Y varies as a 
function of the Time-1 data. To single out for 
intensive study persons who do better (or 
worse) than predicted, for example, it is wise to 
define expected outcome as the forecast of Y 
on the basis of all Time-1 information. Instead 
of D x -X x (= Foo-XJ one would estimate 
D„-X w W x (= Y„-X x , W x ). The machinery 
suggested above for partialling out X„ would 



be used, but extended by partialling the W x 
also out of D x . The entire set of W and X mea- 
sures constitute the "base." There are optional 
targets for investigation : D x ; D a -X x ; D m -X K , 
W x ; etc. 

Learning or growth is multidimensional; 
many measures could be taken at each point in 
time. To select one particular Y as somehow 
integrating a variety of subcriteria is to sacri- 
fice information and possible insight. A per- 
son's change is better described by a vector of 
true scores W w X M ; Y K , Z K . Each of these can 
be estimated by the methods used in Equations 
19 and 20. One who wants to examine predicted 
and residual change will estimate F«, from 
and X K , and also from all variables together. 
He will obtain these scores : 

F M | WXYZ (estimated true final status) 
F„ | W K X X (predicted true final status) 
Difference (estimate of unpredicted true 

residual) 

The estimates of W x and X x come from W, X, 
F, and Z. There would be estimates like those 
for F for each Z or for orthogonal components 
of the F M , Z„ space. 

We can rearrange the vectors W, X and F, Z 
in a great variety of ways. Which targei to 
choose can be decided only in the light of the pur- 
poses of the study. 

Purposes of Estimating 
Gains or Differences 

Just why gains or differences are thought to 
be worth estimating can perhaps be inferred 
from the studies where estimates of some sort 
have been made in the past. The following aims 
may be noted : 

1. To provide a dependent variable in an 
experiment on instruction, persuasion, or some 
other attempt to change behavior or beliefs. 

2. To provide a measure of growth rate or 
learning rate that is to be predicted, as a way of 
answering the question, What kinds of persons 
grow (learn) fastest? Here, the change measure 
is a criterion variable in a correlational study. 

3. To provide an indicator of deviant de- 
velopment, as a basis for identifying individuals 
to be given special treatment or to be studied 
clinically. 

4. To provide an indicator of a construct 
that is thought to have significance in a certain 
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theoretical network. The indicator may be 
used as an independent variable, covariate, 
dependent variable, etc. An example is the 
interference score needed on the Stroop Color 
Word Test to represent the decline in reading 
rate when color names are printed in colors in- 
congruous with the names. 

Much of the confusion in the literature arises 
from a failure to distinguish these purposes 
and to match distinct methodological recom- 
mendations to them. 

Gains as a Consequence of Treatments 

There appears to be no need to use measures 
of change as dependent variables and no virtue 
in using them. If one is testing the null hy- 
pothesis that two treatments have the same 
eff eet, the essential question is whether posttest 
Y x scores vary from group to group. Assuming 
that errors of measurement of Y are random, Y 
is an entirely suitable dependent variable. 

The randomized experiment. Suppose that 
cases are assigned to treatments in a random or 
stratified-random manner. The X scores will 
vary within groups. An analysis of covariance 
to take this variation into account is advan- 
tageous so long as pxy is large. (If p < 0.4, block- 
ing is probably to be preferred, according to 
Elashoff, 1969.) The usual adjustment esti- 
mates the F scores expected under the null hy- 
pothesis and then expresses each observed Y 
as a deviation from the estimate. Ordinarily it 
is desirable to base the adjustment, not on X, 
but on whatever linear combination of X and 
W best predicts Y within groups. 

Where within-treatment regressions are 
linear but significantly different in slope, the 
difference between effects of treatments de- 
pends on the level of X. The "main effect" is 
not interpretable. The most meaningful report 
consists of regression functions for Y on the 
X K , W„ space, computed with the aid of the 
covariance matrix for true scores within each 
group in turn. 

Nowhere in this section have we made use of 
a change score. We consider it likely that 
change will vary systematically with X x . 
Where this is the case, the essential result is a 
regression function, not a mean gain. The ad- 
justed Y score of the significance test is a sort 
of residual gain, but the procedure does not 



involve calculating residual gain scores for 
individuals. 

Comparison of treatment groups not formed at 
random. W T hen treatments are applied to groups 
differentiated by a nonrandom process, the X m 
distributions within the subpopulations repre- 
sented by the groups are generally not the 
same. Consequently, the same observed X 
score implies a different level of true pretest 
ability, depending on the group. 

If analysis of covariance is to be made, it is 
advisable to regress the covariate toward the 
mean of the treatment group before entering 
it in the analysis (Lord, 1960). If there is W 
information as well as X, it also contributes to 
the estimate. So does Y and Z information. 
Here is a paradox : A proposal to use the post- 
test score to estimate the pretest true score 
which will then be used to adjust posttest 
scores! The crucial point is that the estimator 
of the covariate is determined from within- 
group data. Since the estimate of any linear 
function of X x and W x has the same within- 
group mean as that function of X and W, the 
procedure does not introduce bias nor does it 
reduce any effect truly attributable to the 
treatment. 

Application of analysis of covariance to 
studies where initial assignment was nonran- 
dom, which was widely recommended 10 years 
ago, is now in bad repute. Even the elaborate 
technique just suggested is no more than a 
palliative. If the treatment groups differed 
systematically at the start of the experiment 
with respect to any relevant characteristic 
other than the covariate, even a perfect mea- 
sure of the covariate cannot remove the con- 
founding. To quote Lord (1967), "there simply 
is no logicai or statistical procedure that can be 
counted on to make proper allowances for un- 
controlled preexisting differences between 
groups [p. 305]." And Meehl (1970) calls such 
corrections "inherently fallacious [in press]." 

The tindings of the study can be usefully 
summarized by calculating within-group re- 
gression functions relating F M to X x , W x , using 
the covariance matrix for true scores. What 
cannot be done is to "compare treatment 
effects." 

One-group designs. A third kind of experi- 
ment is the simple one-group study where one 
wishes to learn whether a treatment produces 



MEASVRING "CHANGE' 



79 



significant change, or to describe the magnitude 
of the effect. An estimate of true gain might ap- 
pear to be pertinent. But it is not. For if one 
were to estimate D x for each individual, and 
average, he would arrive back at the sample 
mean of observed gain. A significance test need 
only ask whether hy is reliably different from 
fix- The difference in sample means for X and 
Y is the best available estimate of the mean D x . 

Criteria in Correlational Studies. 

Correlational studies are often intended to 
investigate a question such as this : Among per- 
sons with a given pretest score, what attributes 
distinguish those who profit most from the 
treatment? This may seem to ask about odw or 
PD„w„, or perhaps about pwx)w or P(D„-x„)w- 
It is more straightforward to ask about the re- 
gression of F on X x and W x , the corresponding 
correlation for Y or Y x , or related partial cor- 
relations. It appears that nothing is gained by 
referring to change measures in this context. 
The relationships of true scores can be investi- 
gated without estimating true scores for indi- 
viduals. 

Selecting Individuals on the Basis of Gain or 
Difference Scores. 

Many who calculate difference scores are 
interested in making decisions about individ- 
uals — identifying underachievers for clinical 
attention or fast learners for special opportuni- 
ties, for example. One can scarcely defend 
selecting such individuals on a raw-gain or 
raw-difference score, especially as these scores 
tend to show a spurious advantage for persons 
low on X. Selecting cases whose estimated Y x 
is higher than that of others with similar X x 
and TF«, seems more sensible. To do this, re- 
gression equations should be called into play. 
That is, one selects persons for whom Y x 
WXYZ is much larger (or smaller) than Y x 

\Kx x . 

The persons with positive deviations are 
those who did better than predicted. This 
means either that they started with some valu- 
able attribute the W and X variables did not 
encompass, that their pretest true scores are 
underestimated or their posttest scores are 
overestimated, or that their success on F was 
an accidental effect arising from some tactic 
casually adopted during learning or some se- 
quence of lucky trials. It is very hard to dispose 



of the hypothesis that these unexpected gains 
were fortuitous. 

Here, the focus of attention is on an esti- 
mated residual gain : not D • X, not D x • X x , but 
Y X -X X W X or, what is equivalent, £> X -X X W X . 
Where X alone is available as a predictor, the 
raw residual gain selects the same persons as 
Formula 23 does. But Formula 24 is to be to 
preferred. 

It is possible of course, given before-and- 
after scores on the same instrument, to esti- 
mate true gains of individuals and to identify 
those who did and did not gain. But to what 
purpose? This has no clear bearing on decisions 
about the future of these persons, and the de- 
cision rule for fresh cases is to be inferred from 
the regression surface. 

Differences and Gain Scores as Constructs. 

One of the most common uses of difference 
scores is to operationalize a concept: For ex- 
ample, self-satisfaction is sometimes defined as 
the difference between the rating of self and 
ideal-self on an esteem scale. One might like- 
wise think of a gain score as reflecting "learn- 
ing ability" on a certain task. Operational 
definitions will often take the form of linear 
combinations of operations. 

But there is little a priori basis for pinning 
one's faith on Y x — X x as distinct from the 
more general Y x — aX x . Just what weight to 
assign the "correcting" variable is an empiricai 
question. To arbitrarily confine interest to D x 
(which means that a is fixed at 1.00) is to rule 
out possible discoveries. This argues, then, for 
discovering what function of Y x and X x has 
the strongest relationships with variables that 
should connect with the construct. 

The claim that an index has validity as a 
measure of some construct carries a consider- 
able burden of proof. There is little reason to 
believe and much empiricai reason to disbe- 
lieve the contention that some arbitrarily 
weighted function of two variables will properly 
define a construct. More often, the profitable 
strategy is to use the two variables separately 
in the analysis so as to allow for complex re- 
lationships. 

One example of an "obvious" but question- 
able use of a subtractive correction is provided 
by a study in which skin conductance is a vari- 
able. At the start of the experiment a "base- 
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line" measure of the subject's galvanic skin 
response is taken. Then stress is applied and a 
second measure is taken. During a rest period 
the subject receives a drug or a placebo. Stress 
is again applied and a third measure taken. 
Call the measures, in order, W, X, and F. 
Simple correction would use Y — W as de- 
pendent variable and X — W as covariate. 
We, however, would prefer to use X and W as 
separate covariates, with Y as dependent vari- 
able. This should give a more precise analysis 
when W is unreliable. (As suggested earlier, it 
would generally be still better to use J?oo I 
WXY and Wj\WXY as covariates.) 

Summary 

Where true scores for individuals are de- 
sired, multiple regression procedures outlined 
herein make use of more information than do 
procedures hitherto advanced. There seems to 
be no occasion to estimate true gain scores. In 
the experiment where treatment groups are 
formed nonrandomly, estimates of true scores 
on the covariate can reduce the resulting bias. 

Where individuals who have exceptionally 
high or low residual gains are to be identified, 
the raw residual gain serves as well as the alter- 
nate formulas hitherto advanced. To estimate 
the individual's true residual gain, however, a 
superior formula is available. 

Where correlations and regression functions 
relating true gains or true residual gains to 
other variables are desired, a calculating routine 
is available that makes it unnecessary to esti- 
mate gain scores for individuals. 

It appears that investigators who ask ques- 
tions regarding gain scores would ordinarily 
be better advised to frame their questions in 
other ways. 
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